Using machine learning to forecast domestic homicide via police data and super learning

We explore the feasibility of using machine learning on a police dataset to forecast domestic homicides. Existing forecasting instruments based on ordinary statistical instruments focus on non-fatal revictimization, produce outputs with limited predictive validity, or both. We implement a “super learner,” a machine learning paradigm that incorporates roughly a dozen machine learning models to increase the recall and AUC of forecasting using any one model. We purposely incorporate police records only, rather than multiple data sources, to illustrate the practice utility of the super learner, as additional datasets are often unavailable due to confidentiality considerations. Using London Metropolitan Police Service data, our model outperforms all extant domestic homicide forecasting tools: the super learner detects 77.64% of homicides, with a precision score of 18.61% and a 71.04% Area Under the Curve (AUC), which, collectively and severely, are assessed as “excellent.” Implications for theory, research, and practice are discussed.


Supplemental Introductory Materials Estimating Domestic Homicides in London
The London Metropolitan Policing Services ("MPS") does not report cases of domestic homicides via its publicly available datasets. 1 However, MPS did report that they experienced an average of 132.5 homicides per year during the four-year period between 2018 and 2022.3][4][5] When applying these macro-level statistics to the MPS annual homicide cases, we estimates that 25 of these 132.5 homicides are domestic.

Existing Domestic Homicide Forecasting Tools
Domestic homicide forecasts are often conducted via screening tools. 6.These screening tools do not use machine learning; they are typically questionnaires administered to a victim of domestic abuse, and they assign the victim a risk category (e.g., high, medium, or low).In a state-of-the-art literature review, Graham and colleagues identified 18 different screening tools that were administered in eight countries applied by various professional groups (law enforcement, social workers, academics, researchers, etc.). 7Most screening tools did not forecast domestic homicide; they predicted whether an individual would suffer from a future, non-lethal domestic abuse incident. 8However, these tools cannot be used in this study because factors that predict domestic abuse may differ from those that predict domestic homicide. 9All abuse is damaging, but there are variations between different types of assaults, threats, and violence that perpetrators inflict on their victims.Severity scores can be used as indexes that enable a more nuanced classification of offenders.This can then be used to create typologies of domestic abuse pathologies, such as coercion, pathological behavior, sadism, or homicidal behavior. 10,11Critically, research shows that there are phenomenological differences between domestic homicide and non-lethal domestic abuse in terms of the "emotions that trigger it, the circumstances that led up to it, and the state of mind that characterizes it." 12Therefore, predicting non-lethal repeat victimization is not necessarily similar to predicting lethal domestic abuse.
Still, we identified two instruments that attempted to forecast near-fatal domestic abuse, or domestic abuse incidents in which the victim was nearly killed-the Danger Assessment and Lethality Assessment Programme-which are described below.
The Danger Assessment.The first tool that forecasts near-fatal domestic abuse is the Danger Assessment.Developed by Campbell, 13 this questionnaire was published in Advances in Nursing Science and is intended to be administered in a medical setting.It consists of 20 weighted questions and assigns the respondent a risk category.The survey is calibrated for female victims of domestic abuse, so it may not be fitting for male victims (e.g., the instrument queries about strangulation, abuse during pregnancy, etc.). 14 is difficult to measure the performance of this tool because the original validation study did not report false discovery or precision rates. 14Nearly all of the other studies in Graham et al.'s review report on these crucial model performance indicators, and without them, it is difficult to determine how well the Danger Assessment is performing. 7Fortunately, a related study found that a modified Danger Assessment can predict 83% of near-fatal domestic abuse with a false discovery rate of 75%. 15In other words, only one of the four women the model identified as being at exceptionally high risk would actually die at the hands of her abuser.Other studies have received similar results when they used this tool to predict less extreme manifestations of domestic abuse, [16][17][18] suggesting these findings have been somewhat replicated.

Lethality Assessment Programme. The second forecasting tool is a modified version of the Danger
Assessment created for law enforcement by Messing and colleagues. 19,20Specifically, the US Department of Justice funded the original validation studies, and they were conducted through

Dataset Description
Describe statistics that describe the categorical (Table S1) and continuous (Table S2) features appear below.Information concerning each feature can be inferred from its name, except for the initial risk assessment.The initial risk assessment was conducted according to the guidelines outlined in the Domestic Abuse, Stalking, Harassment and Honour-Based Violence Assessment (DASH) programme.This was the most widely used domestic abuse risk assessment system in the United Kingdom. 21Specifically, the programme required officers to ask a series of questions to the victim of a suspected domestic abuse case.Then officers would rank a case as high, medium, or standard risk. 22SH is not a forecasting instrument; it is simply meant to assess whether there is a risk of serious harm. 22Moreover, unlike previous forecasting instruments, 7 it is not a quantitative risk-rating system: it is a heuristic instrument risk rating.The questions are meant to inform an officer's intuition, but there are no numeric guidelines to guide officers in assigning a risk rating.For this reason, these risk ratings may have considerable variation between officers and over time.

Preprocessing
Five essential data modifications were made to convert these features into a format conducive to machine learning: ordinal encoding, one-hot encoding, standardization, the removal of irrelevant features, and the removal of redundant features. 23First, all ordinal features were converted to numeric values representing their natural order.For example, the risk level of a domestic abuser was encoded as high, medium, small, or unknown, so these categories were converted to the values 3, 2, 1, and 0. Second, nominal categorical features were one-hot encoded into binaries.
Third, the dataset had three continuous features: age, crime count, and domestic violence count.
Metric variables are typically standardized via z-scores in machine learning, 23 so age was likewise standardized.Crime counts and domestic violence counts, however, were not standardized.
Specifically, most people in the dataset have never committed a crime or an incident of domestic violence; even if they had, it was typically only a couple of episodes at most.Converting these values to z-scores might obfuscate their meaning, given their exceptionally low variance, and thus, they were left unmodified so that they can be treated as ordinal.Fourth, the MPS assigned an arbitrary case ID, so it was dropped due to it being irrelevant.Fifth, there were two redundant features: crime prevalence and domestic violence prevalence.Both crime and domestic violence prevalence measure whether someone committed a crime or domestic violence incident; this information is already measured by the crime count and domestic violence count features, so they were accordingly dropped.

Model Evaluation: Key Performance Indicators
Four key performance indicators were used to assess and evaluate the different models: recall, precision, specificity, and Area Under the Curve (AUC).They were selected because they are used to estimate the predictive validity of the various instruments in both the machine learning literature and the domestic abuse screening tool literature. 6,7,24These key performance indicators were derived from values from a confusion matrix, whose discussion appears below.
model got wrong.Percentages can also be displayed alongside counts.These four measurements are the foundation for the key performance indicators described below.
Recall/Sensitivity.Recall is the proportion of homicide cases that the model could flag.For example, if there were 100 homicide cases and the model had a recall score of 0.5, then the model could identify 50 of the 100 cases.Recall is also known as the "true positive rate," and its equation is given by Powers: 24 (S1) / =    () =  + Precision.Precision refers to the proportion of cases that the model predicted to result in a homicide that actually occurred.For example, if a model had a precision rate of .25, then this means that for every four cases the model flagged as being at risk of homicide, only one will actually be a homicide.
The other three cases are false discoveries.For this reason, precision is also known as the mathematical complement of the false discovery rate, and its equation is given by Powers: 24 (S2)  = 1 −    =  + Specificity/True Negative Rate.Specificity is the proportion of cases that the model predicted to never result in a homicide that was correct.For example, if a model had 100 non-homicide cases and a specificity of .75, then it would interpret 75 of those 100 cases as non-homicides.It is also Area Under Curve (AUC).The AUC is a metric traditionally used in assessing medical tests that has become somewhat ubiquitous in machine learning. 26,27To provide context, each model faces a fundamental trade-off between true positives and false positives.For example, a model may be able to detect more homicides (true positives) if it lowers the decision threshold needed to flag a homicide; a model can be configured to require 10% confidence to flag a homicide rather than 50% confidence.This decision would undoubtedly result in more homicide detections, but it would also result in more false positives-hence the trade-off emerges.This trade-off occurs again if one were to raise the decision thresholdrequiring 99% confidence to detect a homicide, for examplealbeit the consequences are reversed.
This trade-off can be graphically illustrated through the Receiver Operator Characteristics Curve (ROC), which plots true positives against false positives across various decision thresholds. 28One can take the area under the curve (AUC) of the ROC graph to summarize the overall quality of this trade-off, where higher values indicate a better trade-off.AUC is occasionally used as a measure of overall performance, and it is calculated as follows, using the true positive rate and false positive rates that are defined in (S1) and (S3): 24,29 (S4)  = ∑ A Note on Model Assessment: Decision Thresholds.The AUC score implies that the key performance indicatorsi.e., recall, precision, and specificity -depend strongly on the model's decision threshold, which can be configured.In other words, lowering the model's decision threshold would usually lead to a higher recall at the expense of specificity and vice-versa for raising the threshold.Thus, unless otherwise stated, all machine learning classifiers used a decision threshold that was set to 0.5.In other words, these models would predict a homicide as long as they were at least 50% confident.These confidence scores were probability-like estimates that ranged between 0 and 1, and they were obtained from models via the scikit-learn 'predict_proba' function. 29is assertion holds for both the initial models and the models produced from the super dataset.

Initial Model Creation
The ten machine learning classifiers that appeared in the main text were used to produce models, and they were created via the following procedure.First, each classifier was trained and evaluated via ten-fold cross-validation with its default hyperparameters, as defined by the scikit-learn Python package. 29Second, the classifier was optimized for the MPS dataset via a procedure known as hyperparameter tuning.Details of the complete tunning procedure appear later in these supplemental materials and in Table S6.
Third, the hyperparameter tuning produced an optimized classifier, trained on the MPS dataset to produce an optimized model via ten-fold cross-validation.During this procedure, two models were produced: one from the default classifierthe preliminary model -and one from the optimized classifier, the optimized model.Typically, the preliminary model was disregarded while the optimized model was used to create the super dataset.In other words, the optimized model almost always performed significantly better than the preliminary model, so there was little need to use the worst-performing preliminary model to create the super dataset.However, if the preliminary model performed exceptionally, then it was also used.Each model's recall, precision, specificity, and AUC scores appear below in Table S4.

Hyperparameter Tuning
The hyperparameter tuning was undertaken via an exhaustive grid search in a computationally powerful computer environment.Specifically, it was launched on a Google Cloud virtual machine with 60 Intel C2-series Cascade Lake CPUs with a 100 GB boot disk and 240 GB memory.The initial modelsthose trained on the MPS datasetand the model candidatesthose trained on the super datasetunderwent the same hyperparameter tuning procedure, with the only difference being their dataset.All hyperparameter tuning was undertaken via the scikit-learn package in Python, details of which can be found in both its documentation and in Table S6. 29able S6.Classifiers with their default hyperparameters and the results of their hyperparameter tuning.Asterisk indicates that a classifier was trained with optimized hyperparameters taken earlier in the experiment.However, this AUC score is still crucial for the inter-model comparisons, so the estimate will still be used.

(
− −1 ) * (  − −1 ) 2  =1Note that FPR and TPR are vectors of false and true positive rates, respectively, across various decision thresholds, with n being their length.AbiNader et al. provide guidance on meaningful interpretation of the AUC score: a score of 1 reflects being a perfect classification, 0.5 reflects a chance classification, 0 reflects a perfect misclassification, and a value of 0.71 or greater reflects an excellent model.6

Table S1 .
Descriptive statistics of categorical features and label in MPS dataset.The label is marked with an asterisk (*)

Table S2 .
Descriptive statistics of continuous features in MPS dataset.

Table S3 .
Example confusion matrix.Green shading indicates correct predictions, whereas red shading indicates incorrect predictions.knownas the true negative rate.It is the mathematical complement of the false positive rate, and its

Table S4 .
Key performance indicators of all preliminary and optimized models.Asterisks indicate the model was used in the creation of the super dataset.The Gaussian Naïve Bayes had no meaningful hyperparameters to tune, thus only its preliminary model appears.

Table S5 .
Key performance indicators of all preliminary and optimized model candidates, i.e., models that were trained on the super dataset.Asterisks indicate that the model was selected as the super learner's ultimate model.
FPR vector of[1 .79,0],and, using equation (S4), one gets an AUC score of .3745.This AUC score may be flawed: Messing et al. acknowledge that their model was not designed for alternate decision thresholds for near-fatal domestic abuse cases, making AUC a less-than-reliable metric.